Predicting Disease Risks Using Feature Selection Based on Random Forest and Support Vector Machine

نویسندگان

  • Jing Yang
  • Dengju Yao
  • Xiaojuan Zhan
  • Xiaorong Zhan
چکیده

Disease risk prediction is an important task in biomedicine and bioinformatics. To resolve the problem of high-dimensional features space and highly feature redundancy and to improve the intelligibility of data mining results, a new wrapper method of feature selection based on random forest variables importance measures and support vector machine was proposed. The proposed method combined sequence backward searching approach and sequence forward searching approach. Feature selection starts with the entire set of features in the dataset. At every iteration, two feature subsets are gained. One feature subset removes those most unimportant features and the most important feature at the same time, which is used to train random forest and to compute feature importance for next feature selection. Another feature subset removes only those most unimportant features while remains the most important feature, which is used as the optimal feature subset to train SVM classifier. Finally, the feature subset with the highest SVM classification accuracy was regarded as optimal feature subset. The experimental results on 11 UCI datasets, a real clinical data sets and a gene expression dataset show that the proposed algorithm can generate the smaller feature subset while improve the classification accuracy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Random Forest Classifier based on Genetic Algorithm for Cardiovascular Diseases Diagnosis (RESEARCH NOTE)

Machine learning-based classification techniques provide support for the decision making process in the field of healthcare, especially in disease diagnosis, prognosis and screening. Healthcare datasets are voluminous in nature and their high dimensionality problem comprises in terms of slower learning rate and higher computational cost. Feature selection is expected to deal with the high dimen...

متن کامل

Prognosis of multiple sclerosis disease using data mining approaches random forest and support vector machine based on genetic algorithm

Background: Multiple sclerosis (MS) is a degenerative inflammatory disease which is most commonly diagnosed by magnetic resonance imaging (MRI). But, since the MRI device uses of a magnetic field, if there are metal objects in the patient's body, it can disrupt the health of the patient, the functioning of the MRI, and distortion in the images. Due to limitations of using MRI device, screening ...

متن کامل

Improvement of Support Vector Machine and Random Forest Algorithm in Predicting Khorramabad River Flow Uusing Non-uniform De-Noising of data and Simplex Algorithm

In this study, in order to simulate the monthly flow of the Khorramabad River, the time series of this river was decomposed into three levels using the wavelet of Daubechies-3, during the period of 1955-2014. Based on this, it was found that there is a Non-uniform noise that includes two periods of time in this signal, with the October 2008 border which required that the signal be become non-un...

متن کامل

Predicting the cause of kidney stones in patients using random forest, support vector machine and neural network

Background: Today, with the advancement of technology in various fields, the importance of recording data in the field of health is increasing so much that for many diseases around the world, including kidney disease, registration systems have been set up. This is happening in our country and in the future, the number of these systems will increase. The medical data set contains valuable inform...

متن کامل

Feature Selection and Classification of Microarray Gene Expression Data of Ovarian Carcinoma Patients using Weighted Voting Support Vector Machine

We can reach by DNA microarray gene expression to such wealth of information with thousands of variables (genes). Analysis of this information can show genetic reasons of disease and tumor differences. In this study we try to reduce high-dimensional data by statistical method to select valuable genes with high impact as biomarkers and then classify ovarian tumor based on gene expression data of...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014